feat: Add phase_status tracking and PARTIAL apply status#1690
feat: Add phase_status tracking and PARTIAL apply status#1690
Conversation
Add per-phase execution status tracking to manifest versions for partial failure recovery. When a manifest apply fails mid-execution, the system now records which phases completed, which failed, and which were skipped. Changes: - Migration: Add phase_status JSONB column to manifest_versions - Migration: Add PARTIAL to apply_status CHECK constraint (separate file for CockroachDB compatibility) - Domain: PhaseStatusMap, PhaseStatusEntry types with JSON serialization - Repository: UpdatePhaseStatus method, GetPhaseStatus/SetPhaseStatus on VersionEntity - Proto: PhaseStatusDetail message, phase_status map on ManifestVersion and ApplyManifestResponse, APPLY_STATUS_PARTIAL and APPLY_MANIFEST_STATUS_PARTIAL enum values - gRPC handler: Derives phase status from execution plan phases and saga step results, classifies partial vs complete failure - Tests: Unit tests for classification, phase status building, and round-trip serialization; integration tests for DB persistence
📝 WalkthroughWalkthroughThis change introduces phase-level status tracking for manifest application: proto additions for phase details and partial status, runtime phase tracking and classification in the applier, persistence of phase status in history/repository, DB migrations for a JSONB phase_status column and PARTIAL apply state, and tests covering the new behavior. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Handler as ApplyManifestHandler
participant Planner as ExecutionPlanner
participant Executor as Executor
participant StatusMgr as PhaseStatusManager
participant History as HistoryService
participant DB as Database
Client->>Handler: ApplyManifest(request)
Handler->>Planner: build execution plan
Handler->>Handler: buildInitialPhaseStatus(plan)
Handler->>Executor: Execute plan
Executor-->>Handler: ApplyManifestResult (steps success/fail)
Handler->>StatusMgr: updatePhaseStatus(plan, result, execErr)
alt Partial failure detected
StatusMgr->>StatusMgr: findFailedPhase(...)
StatusMgr->>StatusMgr: mark phases COMPLETED/FAILED/SKIPPED, set timestamps/errors
else All success
StatusMgr->>StatusMgr: mark all phases COMPLETED
end
StatusMgr-->>Handler: phaseStatus map
Handler->>Handler: classifyFailure(phaseStatus) -> (ApplyStatus, ManifestStatus)
Handler->>History: recordHistoryWithPhaseStatus(manifest, phaseStatus, ...)
History->>DB: Store manifest version with phase_status JSONB
DB-->>History: persist result
History-->>Handler: stored ManifestVersion proto
Handler->>Handler: phaseStatusMapToResponseProto(phaseStatus)
Handler-->>Client: ApplyManifestResponse (status + phase_status)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
services/control-plane/internal/manifest/repository.go (1)
79-105: KeepUpdatePhaseStatuson the typedPhaseStatusMapAPI.
UpdatePhaseStatusre-exposes raw JSON even thoughSetPhaseStatusalready owns serialization in this package. Keeping the repository API typed avoids duplicated marshaling and malformed writes leaking into callers.Suggested shape
-func (r *Repository) UpdatePhaseStatus(ctx context.Context, id uuid.UUID, phaseStatusJSON *string) error { +func (r *Repository) UpdatePhaseStatus(ctx context.Context, id uuid.UUID, phaseStatus PhaseStatusMap) error { + var entity VersionEntity + if err := entity.SetPhaseStatus(phaseStatus); err != nil { + return err + } return db.WithGormTenantTransaction(ctx, r.db, func(tx *gorm.DB) error { - result := tx.Model(&VersionEntity{}).Where("id = ?", id).Update("phase_status", phaseStatusJSON) + result := tx.Model(&VersionEntity{}).Where("id = ?", id).Update("phase_status", entity.PhaseStatus) if result.Error != nil { return fmt.Errorf("update phase_status: %w", result.Error) }Also applies to: 264-276
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/control-plane/internal/manifest/repository.go` around lines 79 - 105, The UpdatePhaseStatus method should operate on the typed PhaseStatusMap rather than raw JSON: change UpdatePhaseStatus to accept/return PhaseStatusMap (not []byte/string), call VersionEntity.GetPhaseStatus to read current map, mutate it, then use VersionEntity.SetPhaseStatus to serialize and store it, avoiding direct json.Marshal/json.Unmarshal in the repo layer; update any callers to pass a PhaseStatusMap and remove any duplicated marshaling logic (also apply the same change to the other occurrence referenced around lines 264-276).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/control-plane/internal/applier/executor.go`:
- Around line 210-211: The comment on ApplyManifestResult.Status advertises a
"partial" status that Apply never returns; update the documentation for
ApplyManifestResult.Status (and any related comment blocks) to only list the
statuses actually produced by the Apply function (e.g., "applied" and "failed"),
or explicitly note that "partial" is produced elsewhere (by phase analysis) if
you prefer to keep it — locate the ApplyManifestResult type and the Apply
function references to make the change so the docstring accurately matches
runtime behavior.
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 526-559: The current logic marks the entire failed phase as FAILED
even if earlier calls in that same phase succeeded; modify the update that runs
when failedPhase == p (in the loop that updates phaseStatus using
findFailedPhase, and the analogous block around
classifyFailure/updatePhaseStatus later) to inspect the step-level results
(e.g., result.StepResults or whatever structure holds per-step statuses) for the
failedPhase: if any step in that phase succeeded, treat the phase as having
partial work — set the phase status to manifest.PhaseStatusCompleted (or to an
existing "partial" status if your manifest has one), set CompletedAt = &now and
populate Error from result.Error or execErr as currently done; only mark it
strictly FAILED when no step in that phase completed successfully. Ensure you
update both occurrences (the loop using phases/phaseStatus and the similar block
at lines ~630-649) and reference findFailedPhase, phaseStatus, phases, result,
execErr and classifyFailure when making the change.
- Around line 184-190: The code currently calls
recordHistoryWithPhaseStatus(..., expectedSeq=0) and discards the error,
allowing stale requests to append FAILED/PARTIAL rows; change this to use the
same optimistic-sequence logic as the success path: pass the correct expected
sequence (the same sequence value used when writing successful history) instead
of 0, capture the returned error from recordHistoryWithPhaseStatus, and handle
ErrSequenceConflict the same way the success path does (i.e., detect sequence
conflict and return/translate it to the same conflict response or retry logic).
Locate recordHistoryWithPhaseStatus, classifyFailure, and the
execResult.jobID/phaseStatus usage to implement the change so failed/partial
history writes respect optimistic locking and surface the error.
In `@services/control-plane/internal/manifest/history.go`:
- Around line 281-285: The code currently swallows errors from
entity.GetPhaseStatus() causing malformed phase_status to be omitted; change the
block in EntityToProto (or the function where this snippet lives) to propagate
the error instead of ignoring it: if entity.GetPhaseStatus() returns a non-nil
err, wrap it with context (e.g., "invalid phase_status for manifest <id>" or
similar) and return the error from EntityToProto (or bubble it up) so callers
see the failure; preserve the existing mv.PhaseStatus assignment when err == nil
and phaseStatus != nil.
- Around line 152-157: The code currently ignores errors from
s.repo.UpdatePhaseStatus after calling entity.SetPhaseStatus, resulting in a
false success (PARTIAL) even when phase_status persistence fails; modify the
logic in the function that calls SetPhaseStatus and s.repo.UpdatePhaseStatus so
that you capture the error returned by s.repo.UpdatePhaseStatus and propagate it
(or convert it to an appropriate failure result) instead of discarding it: call
setErr := entity.SetPhaseStatus(phaseStatus); if setErr == nil { if updErr :=
s.repo.UpdatePhaseStatus(ctx, entity.ID, entity.PhaseStatus); updErr != nil {
return updErr (or mark snapshot as failed) } } so callers observe persistence
failures from UpdatePhaseStatus rather than a silent success.
---
Nitpick comments:
In `@services/control-plane/internal/manifest/repository.go`:
- Around line 79-105: The UpdatePhaseStatus method should operate on the typed
PhaseStatusMap rather than raw JSON: change UpdatePhaseStatus to accept/return
PhaseStatusMap (not []byte/string), call VersionEntity.GetPhaseStatus to read
current map, mutate it, then use VersionEntity.SetPhaseStatus to serialize and
store it, avoiding direct json.Marshal/json.Unmarshal in the repo layer; update
any callers to pass a PhaseStatusMap and remove any duplicated marshaling logic
(also apply the same change to the other occurrence referenced around lines
264-276).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bedc0ac2-0bc5-469c-bbc7-040a889fac1a
⛔ Files ignored due to path filters (1)
services/control-plane/migrations/atlas.sumis excluded by!**/*.sum
📒 Files selected for processing (10)
api/proto/meridian/control_plane/v1/apply_manifest_service.protoapi/proto/meridian/control_plane/v1/manifest_history_service.protoservices/control-plane/internal/applier/executor.goservices/control-plane/internal/applier/grpc_handler.goservices/control-plane/internal/applier/grpc_handler_test.goservices/control-plane/internal/manifest/history.goservices/control-plane/internal/manifest/repository.goservices/control-plane/internal/manifest/repository_test.goservices/control-plane/migrations/20260316000002_add_phase_status.sqlservices/control-plane/migrations/20260316000003_add_partial_apply_status.sql
| // Determine if this is a partial failure (some phases succeeded) | ||
| applyStatus, responseStatus := classifyFailure(execResult.phaseStatus) | ||
| response.Status = responseStatus | ||
| response.PhaseStatus = phaseStatusMapToResponseProto(execResult.phaseStatus) | ||
|
|
||
| // Record history with phase status | ||
| _, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus) |
There was a problem hiding this comment.
Don't bypass optimistic locking on failed/partial history writes.
Line 190 hardcodes expectedSeq=0 and drops the returned error, so a stale request can still append a FAILED/PARTIAL history row after another manifest version has already advanced the sequence. This should use the same sequence check and ErrSequenceConflict handling as the success path.
Suggested fix
- _, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus)
+ if _, err := h.recordHistoryWithPhaseStatus(
+ ctx,
+ req.GetManifest(),
+ req.GetAppliedBy(),
+ execResult.jobID,
+ applyStatus,
+ nil,
+ req.GetExpectedSequenceNumber(),
+ execResult.phaseStatus,
+ ); err != nil && errors.Is(err, manifest.ErrSequenceConflict) {
+ return nil, status.Errorf(codes.Aborted, "%v", err)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Determine if this is a partial failure (some phases succeeded) | |
| applyStatus, responseStatus := classifyFailure(execResult.phaseStatus) | |
| response.Status = responseStatus | |
| response.PhaseStatus = phaseStatusMapToResponseProto(execResult.phaseStatus) | |
| // Record history with phase status | |
| _, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus) | |
| // Determine if this is a partial failure (some phases succeeded) | |
| applyStatus, responseStatus := classifyFailure(execResult.phaseStatus) | |
| response.Status = responseStatus | |
| response.PhaseStatus = phaseStatusMapToResponseProto(execResult.phaseStatus) | |
| // Record history with phase status | |
| if _, err := h.recordHistoryWithPhaseStatus( | |
| ctx, | |
| req.GetManifest(), | |
| req.GetAppliedBy(), | |
| execResult.jobID, | |
| applyStatus, | |
| nil, | |
| req.GetExpectedSequenceNumber(), | |
| execResult.phaseStatus, | |
| ); err != nil && errors.Is(err, manifest.ErrSequenceConflict) { | |
| return nil, status.Errorf(codes.Aborted, "%v", err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/control-plane/internal/applier/grpc_handler.go` around lines 184 -
190, The code currently calls recordHistoryWithPhaseStatus(..., expectedSeq=0)
and discards the error, allowing stale requests to append FAILED/PARTIAL rows;
change this to use the same optimistic-sequence logic as the success path: pass
the correct expected sequence (the same sequence value used when writing
successful history) instead of 0, capture the returned error from
recordHistoryWithPhaseStatus, and handle ErrSequenceConflict the same way the
success path does (i.e., detect sequence conflict and return/translate it to the
same conflict response or retry logic). Locate recordHistoryWithPhaseStatus,
classifyFailure, and the execResult.jobID/phaseStatus usage to implement the
change so failed/partial history writes respect optimistic locking and surface
the error.
| // On failure: mark phases based on step results. | ||
| // Steps are executed in phase order. We mark phases as COMPLETED until we | ||
| // find a failed step, then the containing phase is FAILED and the rest are SKIPPED. | ||
| failedPhase := findFailedPhase(plan, result) | ||
|
|
||
| for _, p := range phases { | ||
| key := fmt.Sprintf("phase_%d", p) | ||
| entry := phaseStatus[key] | ||
| entry.StartedAt = &now | ||
|
|
||
| if failedPhase > 0 && p < failedPhase { | ||
| entry.Status = manifest.PhaseStatusCompleted | ||
| entry.CompletedAt = &now | ||
| } else if failedPhase > 0 && p == failedPhase { | ||
| entry.Status = manifest.PhaseStatusFailed | ||
| entry.CompletedAt = &now | ||
| if result != nil { | ||
| entry.Error = result.Error | ||
| } else if execErr != nil { | ||
| entry.Error = execErr.Error() | ||
| } | ||
| } else if failedPhase > 0 { | ||
| entry.Status = manifest.PhaseStatusSkipped | ||
| entry.StartedAt = nil | ||
| } else { | ||
| // No phase-level info available; mark all as failed | ||
| entry.Status = manifest.PhaseStatusFailed | ||
| entry.CompletedAt = &now | ||
| if execErr != nil { | ||
| entry.Error = execErr.Error() | ||
| } | ||
| } | ||
| phaseStatus[key] = entry | ||
| } |
There was a problem hiding this comment.
Partial work inside the first failed phase is still reported as FAILED.
A phase can contain multiple calls. If some calls in phase 1 succeed and a later call in that same phase fails, updatePhaseStatus records only FAILED for phase_1, so classifyFailure returns FAILED because no phase is COMPLETED. That loses the "side effects already happened" signal this PR is adding.
Also applies to: 630-649
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/control-plane/internal/applier/grpc_handler.go` around lines 526 -
559, The current logic marks the entire failed phase as FAILED even if earlier
calls in that same phase succeeded; modify the update that runs when failedPhase
== p (in the loop that updates phaseStatus using findFailedPhase, and the
analogous block around classifyFailure/updatePhaseStatus later) to inspect the
step-level results (e.g., result.StepResults or whatever structure holds
per-step statuses) for the failedPhase: if any step in that phase succeeded,
treat the phase as having partial work — set the phase status to
manifest.PhaseStatusCompleted (or to an existing "partial" status if your
manifest has one), set CompletedAt = &now and populate Error from result.Error
or execErr as currently done; only mark it strictly FAILED when no step in that
phase completed successfully. Ensure you update both occurrences (the loop using
phases/phaseStatus and the similar block at lines ~630-649) and reference
findFailedPhase, phaseStatus, phases, result, execErr and classifyFailure when
making the change.
Claude Code ReviewCommit: SummaryWell-structured PR that adds per-phase execution tracking with proper CockroachDB migration splitting (ADD COLUMN / DROP CONSTRAINT / ADD CONSTRAINT across 3 files). The domain types, proto definitions, and test coverage are solid. Two areas worth addressing: duplicate conversion logic and a fragile positional correlation assumption in Risk Assessment
Findings
Bot Review NotesCodeRabbit thread 1 ( CodeRabbit thread 2 ( |
| entry.StartedAt = nil | ||
| } else { | ||
| // No phase-level info available; mark all as failed | ||
| entry.Status = manifest.PhaseStatusFailed | ||
| entry.CompletedAt = &now | ||
| if execErr != nil { | ||
| entry.Error = execErr.Error() | ||
| } | ||
| } | ||
| phaseStatus[key] = entry | ||
| } | ||
| } |
There was a problem hiding this comment.
MUST FIX: findFailedPhase builds a map keyed by string(call.GRPCMethod) (fully-qualified gRPC paths like meridian.reference_data.v1.ReferenceDataService/RegisterInstrument) but matches against saga.StepResult.StepName which uses handler registry names (like reference_data.register_instrument).
These will never match in production, so findFailedPhase always returns 0, updatePhaseStatus marks all phases as FAILED (the failedPhase == 0 else-branch), and classifyFailure never sees both COMPLETED and FAILED entries. The PARTIAL status is unreachable.
The unit tests mask this because they use matching short names for both GRPCMethod and StepName.
Suggested fix: Add a HandlerName field to PlannedCall that stores the registry name (e.g., reference_data.register_instrument), then use that field as the map key here instead of GRPCMethod. Also update tests to use realistic production names to prevent regression.
- Fix findFailedPhase to use positional correlation instead of name matching. Saga step results use handler names (e.g. "reference_data.register_instrument") while execution plan calls use gRPC method paths. Positional index is the correct mapping. - Fix EntityToProto to propagate phase_status deserialization errors instead of silently dropping them. - Rewrite StoreManifestVersionWithPhaseStatus to include phase_status in the initial Store call (single atomic write) instead of a separate UpdatePhaseStatus call that could fail silently. - Remove misleading "partial" from executor Status field comment since partial classification happens in the gRPC handler, not the executor.
CockroachDB cannot drop and recreate a constraint with the same name in a single transaction. Split the apply_status constraint change into two migrations: one to drop the old constraint, one to add the new.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
services/control-plane/internal/applier/grpc_handler.go (1)
184-191:⚠️ Potential issue | 🟠 MajorOptimistic locking is bypassed for failed/partial history writes.
Line 190 hardcodes
expectedSeq=0and discards both return values, allowing stale requests to appendFAILED/PARTIALhistory rows even after another manifest version has already advanced the sequence. This should use the same sequence check as the success path (line 197).🔧 Proposed fix
- _, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus) + if _, err := h.recordHistoryWithPhaseStatus( + ctx, + req.GetManifest(), + req.GetAppliedBy(), + execResult.jobID, + applyStatus, + nil, + req.GetExpectedSequenceNumber(), + execResult.phaseStatus, + ); err != nil && errors.Is(err, manifest.ErrSequenceConflict) { + return nil, status.Errorf(codes.Aborted, "%v", err) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/control-plane/internal/applier/grpc_handler.go` around lines 184 - 191, The history write on failure currently calls recordHistoryWithPhaseStatus(..., expectedSeq=0) and discards its results, which bypasses optimistic-lock checks; change that call to use the same expected sequence value used by the success path (the sequence variable passed to the success recordHistoryWithPhaseStatus call) and handle/propagate the returned (seq, err) instead of ignoring them so that stale requests cannot append FAILED/PARTIAL rows when the manifest sequence has advanced; locate the call to recordHistoryWithPhaseStatus and replace the hardcoded 0 with the shared expectedSeq variable (and check the returned error/sequence like the success path does).
🧹 Nitpick comments (2)
services/control-plane/internal/applier/grpc_handler.go (1)
652-672: Duplicate conversion logic withmanifest.phaseStatusMapToProto.This function duplicates the logic in
services/control-plane/internal/manifest/history.go:336-356. Consider exposing the manifest package'sphaseStatusMapToProtoand reusing it here, or extracting to a shared location.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/control-plane/internal/applier/grpc_handler.go` around lines 652 - 672, The function phaseStatusMapToResponseProto duplicates manifest.phaseStatusMapToProto; update this code to reuse the shared conversion by either exporting manifest.phaseStatusMapToProto (rename to PhaseStatusMapToProto) or moving the conversion into a shared helper used by both packages, then replace phaseStatusMapToResponseProto’s body with a call to the shared function (preserving the target type controlplanev1.PhaseStatusDetail if necessary by adapting the shared helper return type or adding a thin adapter). Ensure references to manifest.PhaseStatusMap and the existing manifest.phaseStatusMapToProto/PhaseStatusMapToProto symbol are used so logic is not duplicated.services/control-plane/internal/manifest/history.go (1)
135-204: Consider extracting shared logic to reduce duplication.
StoreManifestVersionWithPhaseStatusduplicates most of the logic fromStoreManifestVersion(manifest marshaling, diff summary generation, graph serialization, entity creation). Consider refactoring to a shared internal helper that both methods call, passingphaseStatusas an optional parameter.♻️ Suggested approach
+// storeVersionInternal contains the shared logic for storing manifest versions. +func (s *HistoryService) storeVersionInternal( + ctx context.Context, + mf *controlplanev1.Manifest, + appliedBy string, + applyJobID *uuid.UUID, + applyStatus ApplyStatus, + graph *validator.RelationshipGraph, + expectedSeq int64, + phaseStatus PhaseStatusMap, +) (*VersionEntity, error) { + // ... shared marshaling, diff, graph, entity creation logic ... + if phaseStatus != nil { + if err := entity.SetPhaseStatus(phaseStatus); err != nil { + return nil, fmt.Errorf("failed to serialize phase_status: %w", err) + } + } + // ... store and return ... +} + func (s *HistoryService) StoreManifestVersion(...) (*VersionEntity, error) { - // ... 60+ lines of duplicated logic ... + return s.storeVersionInternal(ctx, manifest, appliedBy, applyJobID, status, graph, expectedSeq, nil) } func (s *HistoryService) StoreManifestVersionWithPhaseStatus(...) (*VersionEntity, error) { - // ... 60+ lines of duplicated logic ... + return s.storeVersionInternal(ctx, mf, appliedBy, applyJobID, applyStatus, graph, expectedSeq, phaseStatus) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/control-plane/internal/manifest/history.go` around lines 135 - 204, StoreManifestVersionWithPhaseStatus duplicates the manifest marshaling, diff summary generation, relationship-graph JSON serialization, and VersionEntity construction found in StoreManifestVersion; refactor by extracting a shared helper (e.g., prepareVersionEntity or buildVersionEntity) that accepts the Manifest (mf), appliedBy, applyJobID, applyStatus, graph, expectedSeq, and an optional PhaseStatusMap, performs the protojson.Marshal, calls generateDiffSummary when needed, serializes the graph, constructs and returns a fully populated VersionEntity (or an entity plus any error), and then have both StoreManifestVersion and StoreManifestVersionWithPhaseStatus call that helper and only handle repo.Store and return; ensure SetPhaseStatus remains invoked inside the helper when a non-nil phaseStatus is provided and preserve existing error handling and semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 184-191: The history write on failure currently calls
recordHistoryWithPhaseStatus(..., expectedSeq=0) and discards its results, which
bypasses optimistic-lock checks; change that call to use the same expected
sequence value used by the success path (the sequence variable passed to the
success recordHistoryWithPhaseStatus call) and handle/propagate the returned
(seq, err) instead of ignoring them so that stale requests cannot append
FAILED/PARTIAL rows when the manifest sequence has advanced; locate the call to
recordHistoryWithPhaseStatus and replace the hardcoded 0 with the shared
expectedSeq variable (and check the returned error/sequence like the success
path does).
---
Nitpick comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 652-672: The function phaseStatusMapToResponseProto duplicates
manifest.phaseStatusMapToProto; update this code to reuse the shared conversion
by either exporting manifest.phaseStatusMapToProto (rename to
PhaseStatusMapToProto) or moving the conversion into a shared helper used by
both packages, then replace phaseStatusMapToResponseProto’s body with a call to
the shared function (preserving the target type controlplanev1.PhaseStatusDetail
if necessary by adapting the shared helper return type or adding a thin
adapter). Ensure references to manifest.PhaseStatusMap and the existing
manifest.phaseStatusMapToProto/PhaseStatusMapToProto symbol are used so logic is
not duplicated.
In `@services/control-plane/internal/manifest/history.go`:
- Around line 135-204: StoreManifestVersionWithPhaseStatus duplicates the
manifest marshaling, diff summary generation, relationship-graph JSON
serialization, and VersionEntity construction found in StoreManifestVersion;
refactor by extracting a shared helper (e.g., prepareVersionEntity or
buildVersionEntity) that accepts the Manifest (mf), appliedBy, applyJobID,
applyStatus, graph, expectedSeq, and an optional PhaseStatusMap, performs the
protojson.Marshal, calls generateDiffSummary when needed, serializes the graph,
constructs and returns a fully populated VersionEntity (or an entity plus any
error), and then have both StoreManifestVersion and
StoreManifestVersionWithPhaseStatus call that helper and only handle repo.Store
and return; ensure SetPhaseStatus remains invoked inside the helper when a
non-nil phaseStatus is provided and preserve existing error handling and
semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 40e51b84-28c5-4b78-bf0e-e9ca2ce074e4
⛔ Files ignored due to path filters (1)
services/control-plane/migrations/atlas.sumis excluded by!**/*.sum
📒 Files selected for processing (5)
services/control-plane/internal/applier/grpc_handler.goservices/control-plane/internal/applier/grpc_handler_test.goservices/control-plane/internal/manifest/history.goservices/control-plane/migrations/20260316000003_add_partial_apply_status.sqlservices/control-plane/migrations/20260316000004_add_partial_apply_status_constraint.sql
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Keep both UpdatePhaseStatus (this branch) and GetBySequenceNumber (from develop). Regenerate proto files for merged proto changes.
|
|
||
| // phaseStatusMapToResponseProto converts a PhaseStatusMap to proto map for the response. | ||
| func phaseStatusMapToResponseProto(ps manifest.PhaseStatusMap) map[string]*controlplanev1.PhaseStatusDetail { | ||
| if ps == nil { |
There was a problem hiding this comment.
Suggestion: phaseStatusMapToResponseProto here is identical to phaseStatusMapToProto in history.go. Two copies of the same conversion logic will drift. Export PhaseStatusMapToProto from the manifest package and call it from both places.
| func findFailedPhase(plan *planner.ExecutionPlan, result *ApplyManifestResult) planner.Phase { | ||
| if result == nil || len(result.StepResults) == 0 { | ||
| return 0 | ||
| } | ||
|
|
||
| for i, step := range result.StepResults { | ||
| if !step.Success { | ||
| if i < len(plan.Calls) { | ||
| return plan.Calls[i].Phase |
There was a problem hiding this comment.
Suggestion: The positional correlation between result.StepResults and plan.Calls is correct today (the executor runs calls sequentially in plan order), but this invariant is implicit. Consider adding a brief comment explaining why positional correlation holds so future editors know this is load-bearing.
| // StoreManifestVersionWithPhaseStatus saves a manifest snapshot with phase-level execution status. | ||
| // Phase status is included in the initial entity creation (single atomic write). | ||
| func (s *HistoryService) StoreManifestVersionWithPhaseStatus( | ||
| ctx context.Context, | ||
| mf *controlplanev1.Manifest, | ||
| appliedBy string, | ||
| applyJobID *uuid.UUID, | ||
| applyStatus ApplyStatus, | ||
| graph *validator.RelationshipGraph, | ||
| expectedSeq int64, | ||
| phaseStatus PhaseStatusMap, | ||
| ) (*VersionEntity, error) { | ||
| if mf == nil { | ||
| return nil, ErrNilManifest | ||
| } | ||
| if appliedBy == "" { | ||
| return nil, ErrEmptyAppliedBy | ||
| } | ||
|
|
||
| marshaler := protojson.MarshalOptions{UseProtoNames: true} | ||
| jsonBytes, err := marshaler.Marshal(mf) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to marshal manifest to JSON: %w", err) | ||
| } | ||
|
|
||
| var diffSummary *string | ||
| if applyStatus == ApplyStatusApplied { | ||
| summary, diffErr := s.generateDiffSummary(ctx, mf) | ||
| if diffErr != nil && !errors.Is(diffErr, ErrVersionNotFound) { | ||
| return nil, fmt.Errorf("failed to generate diff summary: %w", diffErr) | ||
| } | ||
| if summary != "" { | ||
| diffSummary = &summary | ||
| } | ||
| } | ||
|
|
||
| var graphJSON *string | ||
| if graph != nil { | ||
| if graphBytes, graphErr := json.Marshal(graph); graphErr == nil { | ||
| s := string(graphBytes) | ||
| graphJSON = &s | ||
| } | ||
| } | ||
|
|
||
| now := time.Now().UTC() | ||
| entity := &VersionEntity{ | ||
| ID: uuid.New(), | ||
| Version: mf.Version, | ||
| ManifestJSON: string(jsonBytes), | ||
| AppliedAt: now, | ||
| AppliedBy: appliedBy, | ||
| ApplyStatus: applyStatus, | ||
| ApplyJobID: applyJobID, | ||
| DiffSummary: diffSummary, | ||
| RelationshipGraph: graphJSON, | ||
| CreatedAt: now, | ||
| } | ||
|
|
||
| if phaseStatus != nil { | ||
| if err := entity.SetPhaseStatus(phaseStatus); err != nil { | ||
| return nil, fmt.Errorf("failed to serialize phase_status: %w", err) | ||
| } | ||
| } | ||
|
|
||
| if err := s.repo.Store(ctx, entity, expectedSeq); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| return entity, nil | ||
| } |
There was a problem hiding this comment.
Suggestion: StoreManifestVersionWithPhaseStatus is a near-identical copy of StoreManifestVersion (75 lines) differing only by the SetPhaseStatus call. Consider extracting shared entity construction into a private buildVersionEntity helper.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/control-plane/internal/applier/grpc_handler.go (1)
194-216:⚠️ Potential issue | 🟠 MajorPersist
phase_statuson successful applies too.The success path still writes history through
recordHistory, so the storedmanifest_versionsrow never getsphase_status.services/control-plane/internal/manifest/history.goalready addedStoreManifestVersionWithPhaseStatus, andEntityToProtoonly emits phase data when it was persisted, so later history reads will lose the phase status you just returned in the apply response.Suggested fix
- snapshot, seqErr := h.recordHistory(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, manifest.ApplyStatusApplied, validationResult.graph, expectedSeq) + snapshot, seqErr := h.recordHistoryWithPhaseStatus( + ctx, + req.GetManifest(), + req.GetAppliedBy(), + execResult.jobID, + manifest.ApplyStatusApplied, + validationResult.graph, + expectedSeq, + execResult.phaseStatus, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/control-plane/internal/applier/grpc_handler.go` around lines 194 - 216, The manifest version row isn't persisting phase_status on successful applies; after the successful call to recordHistory in ApplyManifest handler, call the new persistence method that includes phase info (StoreManifestVersionWithPhaseStatus) on the version store instead of the current Save so the stored manifest_versions row records execResult.phaseStatus; locate the block around recordHistory and the subsequent h.versionStore.Save call in grpc_handler.go and replace or augment that save with a call to the versionStore's StoreManifestVersionWithPhaseStatus (passing req.GetManifest(), req.GetAppliedBy(), and execResult.phaseStatus) so future history reads via EntityToProto will include the persisted phase_status.
♻️ Duplicate comments (1)
services/control-plane/internal/applier/grpc_handler.go (1)
184-190:⚠️ Potential issue | 🟠 MajorUse the request sequence on failed/partial history writes.
This still hardcodes
expectedSeq=0and dropsrecordHistoryWithPhaseStatus’s error. A stale apply can therefore appendFAILED/PARTIALhistory after the sequence has advanced, andErrSequenceConflictnever reaches the client.Suggested fix
- _, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus) + if _, seqErr := h.recordHistoryWithPhaseStatus( + ctx, + req.GetManifest(), + req.GetAppliedBy(), + execResult.jobID, + applyStatus, + nil, + req.GetExpectedSequenceNumber(), + execResult.phaseStatus, + ); seqErr != nil { + logger.Warn("sequence number conflict during history recording", "error", seqErr) + return nil, status.Errorf(codes.Aborted, "%v", seqErr) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/control-plane/internal/applier/grpc_handler.go` around lines 184 - 190, The history write currently hardcodes expectedSeq=0 and discards errors from recordHistoryWithPhaseStatus; change the call that writes history to use the request sequence (e.g., req.GetSequence()) instead of 0, capture the returned error from recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, req.GetSequence(), execResult.phaseStatus), and propagate sequence conflicts and other errors back to the caller (detect ErrSequenceConflict and convert/return it appropriately) rather than ignoring them so stale applies cannot append FAILED/PARTIAL history.
🧹 Nitpick comments (1)
services/control-plane/internal/manifest/history.go (1)
137-204: Extract the common snapshot-building logic.
StoreManifestVersionandStoreManifestVersionWithPhaseStatusnow duplicate manifest marshaling, diff-summary generation, graph serialization, andVersionEntityconstruction. Pulling that into one helper will make future field or validation changes land in one place instead of drifting between the two write paths.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/control-plane/internal/manifest/history.go` around lines 137 - 204, StoreManifestVersion and StoreManifestVersionWithPhaseStatus duplicate manifest marshaling, diff-summary generation, graph serialization, and VersionEntity construction; extract that logic into a single helper (e.g., buildVersionEntity or buildSnapshot) that accepts (ctx, mf *controlplanev1.Manifest, appliedBy string, applyJobID *uuid.UUID, applyStatus ApplyStatus, graph *validator.RelationshipGraph, now time.Time, phaseStatus PhaseStatusMap) and returns (*VersionEntity, error). Move protojson marshaling, generateDiffSummary (only when applyStatus == ApplyStatusApplied and preserving ErrVersionNotFound handling), graph JSON serialization, and the VersionEntity field population into that helper, and call entity.SetPhaseStatus inside the helper when phaseStatus != nil; then have both StoreManifestVersion and StoreManifestVersionWithPhaseStatus call the helper and only keep repo.Store and expectedSeq handling in the callers to preserve current semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 512-560: The code in updatePhaseStatus (after executor.Apply)
currently backfills StartedAt/CompletedAt with a single synthetic now for every
phase (variables: phases, phaseStatus, findFailedPhase, result, execErr), which
violates the manifest timing contract; instead, stop setting entry.StartedAt and
entry.CompletedAt to &now unless you can source real per-phase timings from the
executor result (e.g., per-step/phase timestamps on result or StepResults) and
copy those into entry.StartedAt/CompletedAt; otherwise, only set entry.Status
and entry.Error as appropriate and leave StartedAt/CompletedAt nil so inaccurate
timestamps are not recorded.
---
Outside diff comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 194-216: The manifest version row isn't persisting phase_status on
successful applies; after the successful call to recordHistory in ApplyManifest
handler, call the new persistence method that includes phase info
(StoreManifestVersionWithPhaseStatus) on the version store instead of the
current Save so the stored manifest_versions row records execResult.phaseStatus;
locate the block around recordHistory and the subsequent h.versionStore.Save
call in grpc_handler.go and replace or augment that save with a call to the
versionStore's StoreManifestVersionWithPhaseStatus (passing req.GetManifest(),
req.GetAppliedBy(), and execResult.phaseStatus) so future history reads via
EntityToProto will include the persisted phase_status.
---
Duplicate comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 184-190: The history write currently hardcodes expectedSeq=0 and
discards errors from recordHistoryWithPhaseStatus; change the call that writes
history to use the request sequence (e.g., req.GetSequence()) instead of 0,
capture the returned error from recordHistoryWithPhaseStatus(ctx,
req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil,
req.GetSequence(), execResult.phaseStatus), and propagate sequence conflicts and
other errors back to the caller (detect ErrSequenceConflict and convert/return
it appropriately) rather than ignoring them so stale applies cannot append
FAILED/PARTIAL history.
---
Nitpick comments:
In `@services/control-plane/internal/manifest/history.go`:
- Around line 137-204: StoreManifestVersion and
StoreManifestVersionWithPhaseStatus duplicate manifest marshaling, diff-summary
generation, graph serialization, and VersionEntity construction; extract that
logic into a single helper (e.g., buildVersionEntity or buildSnapshot) that
accepts (ctx, mf *controlplanev1.Manifest, appliedBy string, applyJobID
*uuid.UUID, applyStatus ApplyStatus, graph *validator.RelationshipGraph, now
time.Time, phaseStatus PhaseStatusMap) and returns (*VersionEntity, error). Move
protojson marshaling, generateDiffSummary (only when applyStatus ==
ApplyStatusApplied and preserving ErrVersionNotFound handling), graph JSON
serialization, and the VersionEntity field population into that helper, and call
entity.SetPhaseStatus inside the helper when phaseStatus != nil; then have both
StoreManifestVersion and StoreManifestVersionWithPhaseStatus call the helper and
only keep repo.Store and expectedSeq handling in the callers to preserve current
semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 222da673-9cff-4b18-8b76-9b304b2514ca
📒 Files selected for processing (5)
api/proto/meridian/control_plane/v1/apply_manifest_service.protoapi/proto/meridian/control_plane/v1/manifest_history_service.protoservices/control-plane/internal/applier/grpc_handler.goservices/control-plane/internal/manifest/history.goservices/control-plane/internal/manifest/repository.go
🚧 Files skipped from review as they are similar to previous changes (1)
- api/proto/meridian/control_plane/v1/apply_manifest_service.proto
| now := time.Now().UTC() | ||
| phases := plan.Phases() | ||
|
|
||
| if execErr == nil && result != nil { | ||
| // All phases completed successfully | ||
| for _, p := range phases { | ||
| key := fmt.Sprintf("phase_%d", p) | ||
| entry := phaseStatus[key] | ||
| entry.Status = manifest.PhaseStatusCompleted | ||
| entry.StartedAt = &now | ||
| entry.CompletedAt = &now | ||
| phaseStatus[key] = entry | ||
| } | ||
| return | ||
| } | ||
|
|
||
| // On failure: mark phases based on step results. | ||
| // Steps are executed in phase order. We mark phases as COMPLETED until we | ||
| // find a failed step, then the containing phase is FAILED and the rest are SKIPPED. | ||
| failedPhase := findFailedPhase(plan, result) | ||
|
|
||
| for _, p := range phases { | ||
| key := fmt.Sprintf("phase_%d", p) | ||
| entry := phaseStatus[key] | ||
| entry.StartedAt = &now | ||
|
|
||
| if failedPhase > 0 && p < failedPhase { | ||
| entry.Status = manifest.PhaseStatusCompleted | ||
| entry.CompletedAt = &now | ||
| } else if failedPhase > 0 && p == failedPhase { | ||
| entry.Status = manifest.PhaseStatusFailed | ||
| entry.CompletedAt = &now | ||
| if result != nil { | ||
| entry.Error = result.Error | ||
| } else if execErr != nil { | ||
| entry.Error = execErr.Error() | ||
| } | ||
| } else if failedPhase > 0 { | ||
| entry.Status = manifest.PhaseStatusSkipped | ||
| entry.StartedAt = nil | ||
| } else { | ||
| // No phase-level info available; mark all as failed | ||
| entry.Status = manifest.PhaseStatusFailed | ||
| entry.CompletedAt = &now | ||
| if execErr != nil { | ||
| entry.Error = execErr.Error() | ||
| } | ||
| } | ||
| phaseStatus[key] = entry |
There was a problem hiding this comment.
Don’t backfill phase timings with one post-run timestamp.
updatePhaseStatus runs only after executor.Apply returns, so every executed phase gets the same synthetic now. That does not match the started_at / completed_at contract added in api/proto/meridian/control_plane/v1/manifest_history_service.proto and will make the stored phase timing misleading. Either thread real per-phase timing out of the executor or leave these fields unset until that data exists.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/control-plane/internal/applier/grpc_handler.go` around lines 512 -
560, The code in updatePhaseStatus (after executor.Apply) currently backfills
StartedAt/CompletedAt with a single synthetic now for every phase (variables:
phases, phaseStatus, findFailedPhase, result, execErr), which violates the
manifest timing contract; instead, stop setting entry.StartedAt and
entry.CompletedAt to &now unless you can source real per-phase timings from the
executor result (e.g., per-step/phase timestamps on result or StepResults) and
copy those into entry.StartedAt/CompletedAt; otherwise, only set entry.Status
and entry.Error as appropriate and leave StartedAt/CompletedAt nil so inaccurate
timestamps are not recorded.
Design suggestions acknowledged - timestamps represent recording time, optimistic locking on failure path is pre-existing behavior, and code dedup is follow-up work
Summary
PARTIALapply status distinguishes partial failures (some phases succeeded) from complete failuresChanges
Migrations (CockroachDB-safe)
20260316000002_add_phase_status.sql: Addphase_status JSONBcolumn tomanifest_versions20260316000003_add_partial_apply_status.sql: Alter CHECK constraint to includePARTIAL(separate migration for CockroachDB compatibility)Proto
PhaseStatusDetailmessage with status, started_at, completed_at, error fieldsphase_statusmap field onManifestVersionandApplyManifestResponseAPPLY_STATUS_PARTIALandAPPLY_MANIFEST_STATUS_PARTIALenum valuesDomain (manifest package)
PhaseStatusMap,PhaseStatusEntry,PhaseExecutionStatustypesGetPhaseStatus()/SetPhaseStatus()onVersionEntityfor JSON round-trippingUpdatePhaseStatus()repository methodStoreManifestVersionWithPhaseStatus()onHistoryServiceApplyStatusPartialconstantgRPC Handler (applier package)
buildInitialPhaseStatus()derives pending phase entries from execution planupdatePhaseStatus()marks phases as COMPLETED/FAILED/SKIPPED based on saga step resultsclassifyFailure()distinguishes partial vs complete failureTask Master
045-manifest-source-of-truth.11 - Add phase_status and PARTIAL apply status
Test plan
classifyFailure(nil, all-failed, partial)buildInitialPhaseStatusfrom execution planupdatePhaseStatus(all success, partial failure)findFailedPhase(no result, with result)phaseStatusMapToResponseProto(nil, populated)VersionEntity.GetPhaseStatus/SetPhaseStatusround-tripPARTIALapply status persists to CockroachDBUpdatePhaseStatusJSONB column round-trip